This page was intentionally left blank
Coding Replications
For coding replications, whenever applicable, please follow this page or hover on the specific slides with containing coding chunks.
.qmd
format, containing a thorough discussion for all examples that have been showcased. This file, that will be posted on eClass®, can be downloaded and replicated on your side. To do that, download the file, open it up in RStudio, and render the Quarto document using the Render button (shortcut: Ctrl+Shift+K
).For most of the topics within the study of finance, there is a well-grounded, established use of statistical, economic, and mathematical concepts that set the stage for data analysis:
Back in the pre-internet era, the use of technology to support those activities was limited to a smaller set of players (e.g, hedge funds, banks, investment trusts). Nowadays, financial information is accessible to the broader public almost in real time:
Not only the availability of financial data, but also the necessary technology to process it, were among the bottlenecks for the adoption of such methods in financial practice
Nowadays, the widespread adoption of open-source technologies, such as and , helped bridging the gap towards a more inclusive environment for those methods
Despite such advances, one quickly learns that the actual implementation of models to solve problems in the area of financial economics is typically rather opaque:
It is often said that more than 80 percent of data analysis is spent on preparing data rather than analyzing it
As you start working with data, you quickly realize that you indeed spend a lot of time reading, cleaning, and transforming your data just
A note on Tidy Data
“Tidy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).”
In its essence, tidy data mainly follows three principles:
In addition to the data layer, there are also tidy coding principles outlined in the tidy tools manifesto that we’ll try to follow:
Luckily, the community has already took a stab at creating tools and organizing a unified approach towards tidy analysis
Amongst a diverse set of option for tidy data manipulation, the tidyverse contains packages that follow a unified approach
The tidyverse
is an opinionated collection of packages designed for data science
All packages share an underlying design philosophy, grammar, and data structures
It is supported by Posit, the maintainer of RStudio and R’s largest open-source contributor1
You can install the complete tidyverse
using:
tidyverse
in your session, simply run:dplyr
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarise()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rowsKey Highlights
group_by()
, allowing users to perform operations groupwise%>%
, increases code readability and reproducibilitydplyr
ggplot2
tidyverse
includes the packages that you’re likely to use in everyday data analyses. As of its 1.3.0 version, the following packages are included in the core tidyverse
:ggplot2
is a system for declaratively creating graphics, based on The Grammar of Graphics
You provide the data, tell how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details
Key Highlights
ggplot2
has a rich ecosystem of extensions - ranging from annotations and interactive visualizations to specialized genomics - click here a community maintained listggplot2
tidyr
tidyr
is to help you create tidy data. Tidy data is data where:Key Highlights
tidyverse
tidyverse
packagestidyr
Everybody who has experience working with data is also familiar with storing and reading data in formats like .csv
, .xls
, .xlsx
or other delimited value storage
However, if your goal is to replicate a common task at a predefined time interval, like charting weekly stock prices for a selected bundle of stocks every end-of-week, it might be overwhelming to manually perform these tasks every week
In what follows, we’ll dive in the various sources of financial data - both global as well as specific to the Brazilian financial markets that can be directly fed into your R
session:
So… you have been prompted with the task of collecting daily stock price information for a subset of the U.S Big Techs. How should you do it?
In a nutshell, Yahoo! Finance is your go-to guy:
Highlights: free, quick and easy to setup, with an impressive range of data containing stock prices, dividends, and splits. There is an extensive list of R
packages can be used to retrieve Yahoo! Finance information - including, but not limited to, tidyquant
, quantmod
and yfR
Drawbacks: its API is no longer a fully official API: as a consequence, solutions tipically used to retrieve information may not work in the future if Yahoo Finance change its structure. Importantly, data is not in real-time - often, it comes with a 15-minute delay (see here)
Below, you can find an example of how to use tq_get()
, from the tidyquant
package, to download both single and multiple stock price information
Data is stored in a convenient way that allows users to manipulate data seamlessly - hit Download Data and see how the output would look like in Excel format
Important
Yahoo! Finance provides Open, High, Low, Close, and Adjusted Close trading prices for each asset that is being tracked, where Adjusted Close is defined by the closing price adjusted for dividends and stock splits. If you are using R
, Python
, or any API to pull this data, ensure to use the information adjusted by dividends and splits.
Apart from price-level information, there are plenty of available resources to efficiently download the most commonly used macroeconomic variables directly within an R
session:
FRED
, for free\(\rightarrow\) Related packages: tidyquant
, FredR
, quantmod
, and quandl
\(\rightarrow\) Related packages: wbids
\(\rightarrow\) Related packages: ecb
#Load the ecb package
library(ecb)
#Get information of headline and core inflation for Eurozone countries
key <- "ICP.M.DE+FR+ES+IT+NL+U2.N.000000+XEF000.4.ANR"
#Get the latest 12 observations
filter <- list(lastNObservations = 12, detail = "full")
#Retrieve the data
hicp <- get_data(key, filter)
#Parse time component to proper format
hicp$obstime <- convert_dates(hicp$obstime)
\(\rightarrow\) For full details and implementation of the R
package ecb
, click here
R
session through the provider’s official API1Bloomberg: the Rblpapi
provides access to data and calculations from Bloomberg
Refinitiv Eikon: the DatastreamDSWS2R
provides a set of functions and a class to connect, extract and upload information from the LSEG Datastream database
Quandl: publishes free/paid data, scraped from many different sources from the web. The Quandl
package can be used to retrieve data
Simfin: fundamental financial data freely available to private investors, researchers, and students. The simfinapi
package can be used to retrieve data
FMP: accurate financial data (balance-sheet, income statements, etc), with historical information dating back 30 years in history. The fmpapi
package can be used to retrieve data
R
packages)rbcb
GetTDData
crypto2
alphavantager
Wrapping up on data providers
While some data providers provide their official API for developers, other solutions rely on scraping historical data from the web. As such, some solutions can deprecated after some time if, for example, access is blocked. It is always important to check whether an R
package is not deprecated by looking into the Comprehensive R Archive Network (CRAN) or its GitHub repository.
purrr
purrr
is to enhances R’s functional programming toolkit by providing a complete and consistent set of tools for working with functions and vectorspurrr
takes care of the nitty-gritty detailsKey Highlights
It seamlessly integrates with all tidyverse
packages and functions, allowing users to apply functional programming in the most straightforward way possible
Simplifies the code pipeline to solve fairly realistic problems - e.g, estimating the CAPM for 100+ industries where we have a different number of observations per industry
readr
readr
is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (.csv
) and tab-separated values (.tsv
)Key Highlights
Is generally much faster than base R functions (up to 10x-100x), depending on the dataset
All functions work exactly the same way regardless of the current locale (e.g., thousands and decimal separators)
tibble
tibble
package provides a modern reimagining of a data.frame
, keeping what time has proven to be effective, and throwing out what is notTibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating
It is a nice way to create data frames. It encapsulates best practices for data frames and handles various data formats in an easier way
Key Highlights
print()
method which makes them easier to use with large datasets containing complex objects.stringr
stringr
package provides a cohesive set of functions designed to make working with strings (e.g, qualitative data, such as stock tickers, names, etc) as easy as possible:str_detect()
tells you if there’s any match to the patternstr_locate()
gives the position of the matchstr_count()
counts the number of patternstr_subset()
extracts the matching componentsstr_extract()
extracts the text of the matchstr_match()
extracts parts of the match defined by parenthesesstr_replace()
replaces the matches with new textstr_split()
splits up a string into multiple piecesforcats
forcats
package is to provide a suite of tools that solve common problems with factors, variables that have a fixed and known set of possible values (e.g, a vector that contains all possible days in a week)fct_reorder()
reorders a factor by another variablefct_infreq()
reorders a factor by the frequency of valuesfct_relevel()
changes the order of a factor by handfct_lump()
collapses the least/most frequent values of a factor into a consolidated groupKey Highlights
factor
handles several issues regarding inserting new data